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Abstract 

This paper describes computationally efScient approaches and associ- 
ated theoretical performance guarantees for the detection of known tar- 
gets and anomalies from few projection measurements of the underly- 
ing signals. The proposed approaches accommodate signals of different 
strengths contaminated by a colored Gaussian background, and perform 
detection without reconstructing the underlying signals from the obser- 
vations. The theoretical performance bounds of the target detector high- 
light fundamental tradeoffs among the number of measurements collected, 
amount of background signal present, signal-to-noise ratio, and similarity 
among potential targets coming from a known dictionary. The anomaly 
detector is designed to control the number of false discoveries. The pro- 
posed approach does not depend on a known sparse representation of tar- 
gets; rather, the theoretical performance bounds exploit the structure of a 
known dictionary of targets and the distance preservation property of the 
measurement matrix. Simulation experiments illustrate the practicality 
and effectiveness of the proposed approaches. 
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1 Introduction 



The theory of compressive sensing (CS) has shown that it is possible to accu- 
rately reconstruct a sparse signal from few (relative to the signal dimension) 
projection measurements [9, 15]. Though such a reconstruction is crucial to vi- 
sually inspect the signal, there are many instances where one is solely interested 
in identifying whether the underlying signal is one of several possible signals 
of interest. In such situations, a complete reconstruction is computationally 
expensive and does not optimize the correct performance metric. Recently, CS 
ideas have been exploited in [12, 16,21] to perform target detection and classi- 
fication from projection measurements, without reconstructing the underlying 
signal of interest. In [12,21], the authors propose nearest-neighbor based meth- 
ods to classify a sig nal / e to one of m known signals given projection 
measurements of the form y ~ Af + n E for K < N, where A e M-'^^^ is a 
known projection operator and n ~ A/" (O, j) is the additive Gaussian noise. 
This model is simple to analyze, but is impractical, since in reality, a signal is 
always corrupted by some kind of interference or background noise. Extension 
of the methods in [12,21] to handle background noise is nontrivial. Though [16] 
provides a way to account for background contamination, it makes a strong as- 
sumption that the signal of interest and the background are sparse in bases that 
are incoherent. This might not always be true in many applications. Recent 
works on CS [2,3] allow for the input signal / to be corrupted by some pre- 
measurement noise 6 --^ A/" (O, cr^J) such that one observes y = A{f + b)+n, and 
study reconstruction performance as a function of the number of measurements, 
pre- and post-measurement noise statistics and the dimension of the input sig- 
nal. In this work, however, we are interested in performing target detection 
without an intermediate reconstruction step. Furthermore, the increased utility 
of high-dimensional imaging techniques such as spectral imaging or videogra- 
phy in applications like remote sensing, biomedical imaging and astronomical 
imaging [20,26,35,38 40,47,58] necessitates the extension of compressive target 
detection ideas to such imaging modalities to achieve reliable target detection 
from fewer measurements relative to the ambient signal dimensions. 

For example, recent advances in compressive sensing (CS) have led to the 
development of new spectral imaging platforms which attempt to address chal- 
lenges in conventional imaging platforms related to system size, resolution, and 
noise by acquiring fewer compressive measurements than spatiospectral vox- 
els [8,14,18,50,53,57]. However, these system designs have a number of degrees 
of freedom which influence subsequent data analysis. For instance, the single- 
shot compressive spectral imager discussed in [18] collects one coded projec- 
tion of each spectrum in the scene. One projection per spectrum is sufficient 
for reconstructing spatially homogeneous spectral images, since projections of 
neighboring locations can be combined to infer each spectrum. Significantly 
more projections are required for detecting targets of unknown strengths with- 
out the benefit of spatial homogeneity. We are interested in investigating how 
several such systems can be used in parallel to reliably detect spectral targets 
and anomalies from different coded projections. 
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In general, we consider a broadly applicable framework that allows us to 
account for background and sensor noise, and perform target detection directly 
from projection measurements of signals obtained at different spatial or tempo- 
ral locations. The precise problem formulation is provided below. 

1.1 Problem formulation 

Let us assume access to a dictionary of possible targets of interest V = 

{/*^^/^^^•••,/^'"^}, where /(^^ e for j = 1, . . . , m is unit-norm. Our 
measurements are of the form 

Zi = ^{aj* + bi) + Wi (1) 

where 

• i G {1,. . . , M} indexes the spatial or temporal locations at which data are 
collected; 



Q!i > is a measure of the signal-to-noise ratio at location i, which is 
either known or estimated from observations; 



• * e R^^" for K < N,is a, measurement matrix to be specified in Sec. 2; 

• bi £ ^ A/'(/X5, Sfc) is the background noise vector, and Wi G 

the i.i.d. sensor noise. 

For example, in the case of spectral imaging /* represents the spectrum at the 
i**^ spatial location, and in video sequences /* represents the vectorized image 
frame obtained at the i*^ time interval. In this paper we consider the following 
target detection problems: 

1. Dictionary signal detection (DSD): Here we assume that each /* e V 
for i G {1, • • • ,M}, and our task is to detect all instances of one tar- 
get signal f^^^ G D for some unknown j G {!,..., m}, i.e., to locate 
S = {i '■ fi = /^•'^}- DSD is useful in contexts in which we know the 
makeup of a scene and wish to focus our attention on the locations of a 
particular signal. For instance, in spectral imaging, DSD is used to study 
a scene of interest by classifying every spectrum in the scene to different 
known classes [36,38]. In a video setup, DSD could be used to classify 
video segments to one of several categories (such as news, weather, sports, 
etc.) by projecting the video sequence to an appropriate feature space and 
comparing the feature vectors to the ones in a known dictionary [55] . 

2. Anomalous signal detection (ASD): Here, our task is to detect all signals 
which are not members of our dictionary, i.e., detect S = {i : f* ^ V}. 
(This is akin to anomaly detection methods in the literature which are 
based on nominal, nonanomalous training samples [23,46].) For instance, 
ASD may be used when we know most components of a spectral image 
and wish to identify all spectra which deviate from this model [45]. 
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Our goal is to accurately perform DSD or ASD without reconstructing the 
spectral input /* from Zi ior i £ {1, . . . ,M}. Accounting for background is a 
crucial issue. Typically, the background corresponding to the scene of interest 
and the sensor noise arc modeled together by a colored multivariate Gaussian 
distribution [37]. However, in our case, it is important to distinguish the two be- 
cause of the presence of the projection operator The projection operator acts 
upon the background spcctriim in the same way as on the target spectrum, but 
it does not affect the sensor noise. We assume that 6j and Wi are independent 
of each other, and the prior probabilities of different targets in the dictionary 
pU) = p (^f* = /(j)) for 7 G {1, • • • , m} arc known in advance. If these probabil- 
ities are unknown, then the targets can be considered equally likely. Given this 
setup, our goal is to develop suitable target and anomaly detection approaches, 
and provide theoretical giiarantees on their performances. 

In this paper we develop detection performance bounds which show how 
performance scales with the number of detectors in a compressive setting as a 
function of SNR, the similarity between potential targets in a known dictionary, 
and their prior probabilities. Our bounds are based on a detection strategy 
which operates directly on the collected data as opposed to first reconstructing 
each /* and then performing detection on the estimated signals. Reconstruction 
as an intermediate step in detection may be appealing to end users who wish 
to visually inspect spectral images instead of relying entirely on an automatic 
detection algorithm. However, using this intermediate step has two potential 
pitfalls. First, the Rao-Blackwell theorem [6] tells us that an optimal detec- 
tion algorithm operating on the processed data {i.e., not sufficient statistics) 
cannot perform better than an optimal detection algorithm operating on the 
raw data. In other words, optimal performance is possible on the raw data, 
but we have no such performance guarantee for the reconstructed signals. Sec- 
ond, the relationship between reconstruction errors and detection performance 
is not well understood in many settings. Although we do not reconstruct the 
underlying signals, our performance bounds are intimately related to the signal 
resolution needed to achieve the signal diversity present in our dictionary. Since 
we have many fewer observations than the signals at this resolution, we adopt 
the "compressive" terminology. 

1.2 Performance metric 

To assess the performance of our detection strategies, we consider the False 
Discovery Rate (FDR) metric and related quantities developed for multiple hy- 
pothesis testing problems [5]. Since we collect M independent observations of 
potentially different signals, we are simultaneously conducting M hypothesis 
tests when we search for targets. Unlike the probability of false alarm, which 
measures the probability of falsely declaring a target for a single tost, the FDR 
measures the fraction of declared targets that are false alarms, that is, it pro- 
vides information about the entire set of M hypotheses instead of just one. More 
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formally, the FDR is given by, 

'V 



FDR = E 



R 



where V is the number of falsely rejected null hypotheses, and R is the total 
number of rejected null hypotheses. Controlling the false discovery rate in a 
multiple hypothesis testing framework is akin to designing a constant false alarm 
rate (CFAR) detector in spectral target detection applications that keeps the 
false alarm rate at a desired level irrespective of the background interference 
and sensor noise statistics [36]. 



1.3 Previous investigations 

Much of the classical target detection literature [24, 27, 29, 34, 43] assume that 
each target lies in a P-dimensional subspace of for P < N. The subspace in 
which the target lies is often assumed to be known or specified by the user, and 
the variability of the background is modeled using a probability distribution. 
Given knowledge of the target subspace, background statistics and sensor noise 
statistics, detection methods based on LRTs (likelihood ratio tests) and GLRTs 
(generalized likelihood ratio tests) have been proposed in [24,27,29,34,43,44]. 
A subspace model is optimal if the subspace in which targets lie is known in 
advance. However, in many applications, such subspaccs might be hard to 
characterize. An alternative, and a more flexible option is to assume that the 
high-dimensional target exhibits some low-dimensional structure that can be 
exploited to perform efficient target detection. This approach is utilized in this 
work and in [21] where the target signal in is assumed to come from a 
dictionary of m known signals such that m<^ N, and in [12], where the targets 
are assumed to lie in a low- dimensional manifold embedded in high-dimensional 
target space. 

Recently, several methods for target or anomaly detection that rely on re- 
covering the full spatiospectral data from projection measurements [41,56] have 
been proposed. However, they are computationally intensive and the detec- 
tion performance associated with these reconstructions is unknown. Other re- 
searchers have exploited compressive sensing to perform target detection and 
classification without reconstructing the underlying signal [12,16,21]. In [16], 
the authors propose a matching pursuit based algorithm, called the Incoherent 
Detection and Estimation Algorithm (IDEA), to detect the presence of a signal 
of interest against a strong interfering signal from noisy projection measure- 
ments. The algorithm is shown to perform well on experimental data sets under 
some strong assumptions on the sparsity of the signal of interest and the inter- 
fering signal. In [12], the authors develop a classification algorithm called the 
smashed filter to classify an image in to one of m known classes from K pro- 
jections of the signal, where K < N. The underlying image is assumed to lie on 
a low-dimensional manifold, and the algorithm finds the closest match from the 
m known classes by performing a nearest neighbor search over the m different 
manifolds. The projection measurements are chosen to preserve the distances 
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among the manifolds. Though [12] offers theoretical bounds on the number of 
measurements necessary to preserve distances among different manifolds, it is 
not clear how the performance scales with K or how to incorporate background 
models into this setup. Moreover, this approach may be computationally inten- 
sive since it involves learning and searching over different manifolds. In [21], the 
authors use a nearest-neighbor classifier to classify an A/'-dimensional signal to 
one of m equally likely target classes based on K < N random projections, and 
provide theoretical guarantees on the detector performance. While the method 
discussed in [21] is computationally efficient, it is nontrivial to extend to the case 
of target detection with colored background noise and nonequiprobable targets. 
Furthermore, their performance guarantees cannot be directly extended to our 
problem since we focus on error measures that let us analyze the performance of 
multiple hypothesis tests simultaneously as opposed to the above methods that 
consider compressive classification performance for a single hypothesis test. 

The authors of a more recent work [17] extend the classical RX anomaly 
detector [42] to directly detect anomalies from random, orthonormal projection 
measurements without an intermediate reconstruction step. They numerically 
show how the detection probability improves as a function of the signal-to- 
noisc ratio when the number of measurements changes. Though probability of 
detection is a good performance measure, in many applications controlling the 
false discoveries below a desired level is more crucial. As a result, in our work, 
we propose an anomaly detection method that controls the false discovery rate 
below a desired level. 

1.4 Contributions 

This paper makes the following contributions to the above literature: 

• A compressive target detection approach, which (a) is computationally 

efficient, (b) allows for the signal strengths of the targets to vary with 
spatial location, (c) allows for backgrounds mixed with potential targets, 
(d) considers targets with different a priori probabilities, and (e) yields 
theoretical guarantees on detector performance. This paper unifies pre- 
liminary work by the authors [30,31], presents previously unpublished 
aspects of the proofs, and contains updated experimental results. 

• A computationally efficient anomaly detection method that detects 
anomalies of different strengths from projection measurements and also 
controls the false discovery rate at a desired level. 

• A whitening filter approach to compressive measurements of signals with 
background contamination, and associated analysis leading to bounds on 
the amount of background to which our detection procedure is robust. 

The above theoretical results, which are the main focus of this paper, are sup- 
ported with simulation studies in Sec. 6. Classical detection methods described 
in [7, 19, 24, 27, 29, 33, 34, 36, 37, 42-45, 49] do not establish performance bounds 
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as a function of signal resolution or target dictionary properties and rely on 
relatively direct observation models which we show to be suboptimal when the 
detector size is limited. The methods in [12] and [16] do not contain perfor- 
mance analysis, and our analysis biiilds upon the analysis in [21] to account for 
several specific aspects of the compressive target detection problem. 



2 Whitening compressive observations 

Before we present our detection methods for DSD and ASD problems respec- 
tively, we briefly discuss a whitening step that is common to both our problems 

of interest. 

Let us suppose that there are enough background training data available to 
estimate the background mean Hb and covariance matrix S;,. We can assume 
without loss of generality that = since ^fib can be subtracted from y. 
Given the knowledge of the background statistics, we can transform the back- 
ground and sensor noise model -|- tUj ~ J\f{0, ^Ss*^ -|- cr^ J) discussed in 
(1) to a simple white Gaussian noise model by multiplying the observations 
Zi, i e {!,..., M}, by the whitening filter C# = (^S^*'^ + ct^J)-!/^. This 
whitening transformation reduces the observation model in (1) to 

Vi = C* (* {aif* + bi) + Wi) = aiAf* + m (2) 
^ ^ ' 

where 



A = C#*, (3) 
and rii = C# -|- Wi) ~ Af{0, 1). To veriiy that rii ~ ^(0, J), observe that 

rii = + Wi) ~ Af(0, (*S6*^ + a^l) C| ) . 

^ ^ V 

I 

We can now choose * so that the corresponding A has certain desirable prop- 
erties as detailed in Sec. 3 and Sec. 5. 

For a given A, the following theorem provides a construction of $ that 
satisfies (3) and a bound on the maximum tolerable background contamination: 



Theorem 1. Let B = I — AS(,A . If the largest eigenvalue of S;, satisfies 

where \\A\\ is the spectral norm of A, then B is positive definite and # = 
is a sensing matrix, which can be used in conjunction with a whitening 
filter to produce observations modeled in (2). 



7 



The proof of this theorem is provided in Appendix A. This theorem draws 
an interesting relationship between the maximum background perturbation that 
the system can tolerate and the spectral norm of the measurement matrix, which 
in turn varies with K and A'^. Hardware designs such as those in [14, 50] use 
spatial light modulators and digital micro mirrors, which allow the measurement 
matrix * to be adjusted easily in response to changing background statistics 
and other operating conditions. 

In the sections that follow, we consider collecting measurements of the form 
yi = aiAf* +ni given in (2), where /,* is the target of interest for z = 1, . . . , M, 
and A G M^^^ is a sensing matrix that satisfies (3). It is assumed that any 
background contamination has been eliminated with the whitening procedure 
described in this section. 

3 Dictionary signal detection 

Suppose that the end user wants to test for the presence of one known target 

versus the rest, but it is not known a priori which target from 2? the user wants 
to detect. In this case, let us cast the DSD problem as a multiple hypothesis 
testing problem of the form 

n2^^f* = f^^^ vs. n^^^ : f* ^ f^^^ (5) 

where f'^^^ gV is the target of interest and i = 1,. . . ,M. 
3.1 Decision rule 

We define our decision rule corresponding to target f^^^ e © in terms of a set 
of significance regions such that one rejects the i*^ null hypothesis if its 

test statistic j/j falls in the i*^ significance region. Specifically, Fp^ is defined 
according to 

= {y : logP(/* = f^^%,ai,A) < (6) 
logP(/; = \yi, at, A) for some ^ e {1, . . . , m}, £ ^ j}, 

where logP(/* = = flog (3^) - ''^''"'f + logpt^) is the 

logarithm of the a posteriori probability density of the target /'■'^ at the i*^ 
spatial location given the observations y^, the signal- to- noise ratio aj and the 
sensing matrix A, and p^^^ is the a priori probability of target class j. Note 
that the process of determining these decision regions involves a sequence of 
nearest-neighbor calculations, so the computational complexity scales with the 
number of classes m. In this work, we operate under the assumption that m is 
much smaller than the dimensionality of the datasets we consider. For example, 
if we consider spectral images, then the number of objects (signal classes) that 
make up a scene of interest is often smaller than the number of voxels in the 
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image. This assumption is not unrealistic and has been exploited in earlier 
work such as [36] and the references therein. In most of the prior work we have 
surveyed [10,11], the number of signal classes is less than 35, which doesn't 
make our approach intractable. 

The decision rule can be formally expressed in terms of the significance 
regions as follows: 

reject H^^^ if the test statistic € (7) 

We analyze this detector by extending the positive False Discovery Rate 
(pFDR) error measure introduced by Storey to characterize the errors encoun- 
tered in performing multiple, independent and nonidentical hypothesis tests 
simultaneously [48]. The pFDR, discussed formally below, is the fraction of 
falsely rejected null hypotheses among the total number of rejected null hy- 
potheses, subject to the positivity condition that one rejects at least one null 
hypothesis. The pFDR is similar to the FDR except that the positivity con- 
dition is enforced here. In our context, the positivity condition means that we 
declare at least one signal to be a nontarget, which in turn implies that the 
scene of interest is composed of more than one object in the case of spectral 
imaging, or that the scene is not static in the case of video imaging. 

Consider a collection of significance regions r = {rp'^ : j = l,--- ,M},such 

that one declares 'H.'f^^ if the test statistic yi G Fp-*. The pFDR for nndtiplc, 
nonidentical hypothesis tests can be defined in terms of the significance regions 
as follows: 



pFDR(^) (r) = E 



ViT) 



R{T) 



R{T)>0 



(8) 



where 



M 

^(r) = El^.er")|W} (9) 



is the number of falsely rejected null hypotheses, 

M 



(10) 



is the total number of rejected null hypotheses, and I^^;} = 1 if event E is 
true and otherwise. In our setup, the pFDR corresponds to the expected 
ratio of the number of missed targets to the number of signals declared to be 
nontargets subject to the condition that at least one signal is declared to be a 
nontarget. (Note that this ratio is traditionally referred to as the positive false 
nondiscovery rate (pFNR), but is technically the pFDR in this context because 
of our definitions of the null and alternate hypotheses.) The theorem below 
presents our main result: 
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Theorem 2. Given observations of the form (2), if one performs mul- 
tiple, independent, nonidentical hypothesis tests of the form (5) and de- 
cides according to (7), then the worst-case pFDR given by pFDRj^^g^ = 
maxj£{i_ „j} pFDR^-'^ (F) , satisfies the following bound: 

pFDR_ < min (l, _ ^^'^^^ ) (11) 



where 



Pmax = max p^^\ 

je{l,...,m} 



(Pe)max = max P ( /i 7^ /* ) , and 

ie{l,...,M} V / 



/i = argmaxP(/; = /|yi,ai,A). (12) 

The proof of this theorem is detailed in Appendix B. A key element of our 
proof is the adaptation of the techniques from [48] to nonidentical independent 
hypothesis tests. 

3.2 An achievable bound on the worst-case pFDR 

Theorem 2 in the preceding section shows that, for a given A, the worst-case 
pFDR is bounded from above by a function of the worst-case misclassification 
probabihty. In this section, we use this theorem to establish an achievable bound 
on the worst-case pFDR that explicitly depends on the number of observations 
K, signal strengths {a,}-^i, similarity among different targets of interest, and 
a priori target probabilities. 

Let us first define the quantities 

rf„,i„= min ||/(*)-/(^)|| 

Pmin = min p^^^ 

je{i,...,rn} 

amin = min a,. 

ie{l,...,M} 

Then we have the following theorem, whose proof is given in Appendix C: 

Theorem 3. Let Amax denote the largest eigenvalue of Sf,. For a given < 
e < 1 — Pmax) assume that K and N are sufficiently large so that the following 
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conditions hold: 

'-^---'--p—V + ^K^) +'""^1 2 ) ^'''^ 

Amax < 2' (13b) 



(13c) 

log(l + ^) 

T/ien there exists a K x N sensing matrix A that satisfies the condition of 
Theorem 1, and for which 

pFDR^,, + <£|nn\ ' \ + 



Pmin \ 1 J^min \ 47^ 

fc|^expf-(^±^V ' (14) 



This result has the following implications and consequences: 

1. For a given N , the upper bound (13b) on A„ 
which implies that the system can tolerate more background perturbation 
if we collect more measurements. 

2. The pFDR bound (14) decays with the increase in the values of K, dmin 
and Qimin, and increases as Pmin decreases. For a fixed Pmax, Pmin, ctmm and 
c^min, the bound in (14) enables one to choose a value of K to guarantee 
a desired pFDR value. 

3. The dominant part of the bound (14) is independent of A'', and is only a 
function of K, pmax, Pmin, CKmin, and rfmin- The lack of dependence on A'' 
is not unexpected. Indeed, when we are interested in preserving pairwise 
distances among the members of a fixed dictionary of size m, the Johnson- 
Lindenstrauss lemma [25] says that, with high probabiHty, K = O (logm) 
random Gaussian projections suHice, regardless of the ambient dimension 
N . This is precisely the regime we are working with here. 

4. The bound on K given in (13c) increases logarithmically with the increase 
in the difference between Pmax and Pmin- This is to be expected since 
one would need more measurements to detect a less probable target as 
our decision rule weights each target by its a priori probability. If all 
targets are equally likely, then Pmax = Pmin = and K = O (logm) is 
sufficient provided Q^jj^d^;^ is sufficiently large such that 

log ( 1 + '"^y"" j > log f 1 + J > 1 



11 



(where the first inequality holds since K < N). In addition, the lower 
bound on K also illustrates the interplay between the signal strength of 
the targets, the similarity among different targets in V, and the number 
of measurements collected. A small value of d^in suggests that the targets 
in V are very similar to each other, and thus amin and K need to be high 
enough so that similar targets can still be distinguished. The experimental 
results discussed in Sec. 6 illustrate the tightness of the theoretical results 
discussed here. 

Inspection of the proof shows that if A is generated according to a Gaussian 
distribution, then the conditions of Theorem 3 will be met with high probability. 

4 Extension to a manifold-based target detec- 
tion framework 

The DSD problem formulation in Sec. 1.1 is accurate if the signals in the dictio- 
nary are faithful representations of the target signals that we observe. In reality, 
however, the target signals will differ from the dictionary signals owing to the 
differences in the experimental conditions under which they are collected. For 
instance, in spectral imaging applications, the observed spectrum of any ma- 
terial will not match the reference spectrum of the same material observed in 
a laboratory because of the differences in atmospheric and illumination condi- 
tions. To overcome this problem, one could form a large dictionary to account 
for such uncertainties in the target signals and perform target detection accord- 
ing to the approaches discussed in Sec. 2 and Sec. 3. A potential drawback with 
this approach is that our theoretical performance bound increases with the size 
of V through pmin and rfmin- Instead, one could reasonably model the target sig- 
nals observed under different experimental conditions to lie in a low-dimensional 
submanifold of the high-dimensional ambient signal space as shown to be true 
for spectral images in [22]. We can exploit this result to extend our analysis to 
a much broader framework that accounts for uncertainties in our dictionary. 

Let us consider a dictionary of manifolds Vm = . . . , Al^™^} corre- 

sponding to m different target classes, and that /* for i G {1, . . . , M} is in one 
of the manifolds in T>_\4 . Considering an observation model of the form given 
in (2), our goal is to determine {i : e M.^^^}, where j e {!,..., m} is the 
target class of interest. Let us assume that all target classes are equally likely 
to keep the presentation simple, though the analysis extends to the case where 
the targets classes have different a priori probabilities. Suppose that we collect 
independent sets of measurements {Vijf^i and {Vijf^i- Then, we can use the 
following two-step procedure to extend our DSD method to this manifold-based 
framework: 



1. Given {j/i}, form a data-dependent dictionary Vy. = I f f- 



corresponding to each yi by finding its nearest-neighbor in each manifold: 






= argmaxP(yj| /* = A) 
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for ee{l,...,m} and i = l,...,M. 



2. Given {yi} and corresponding {Vy.}, find 

fi = arg maxP ( yi\ f* = f, Ui, Aj 

and declare that the observed spectrum corresponds to class j if fi = 
JO) 

•' i 

This two-step procedure is studied in [12] for the case {yi} = {yi} where the 
authors provide bounds on the number of projection measurements needed to 
preserve distances among manifolds. However, they do not offer associated 
target detection performance guarantees. Our analysis and the theoretical per- 
formance bounds extend directly to this framework if we collect two sets of 
observations as discussed above. Specifically, the hypothesis tests correspond- 
ing to the second step can be written as 

no^ : /; = f^'^ vs. nu : /* 7^ f'f'^ 

where f'l^'' e Vy. for i = 1, . . . ,M. Since the dictionary in this case changes 
with i, these tests are nonidentical. This is another instance where our extension 
of pFDR-bascd analysis towards simultaneous testing of multiple, independent, 
and nonidentical hypothesis tests (8) is very significant. Following the proof 
techniques discussed in the appendix, we can straightforwardly show that the 
bound in (14) in this manifold setting holds with Pmin = Pmax = I/ti since all 
target classes are assumed to be equally likely here, and d^in = ^^^ie{i,...,M} 
where _ _ 

di= min 

,/f'6X)„, ,£^k 



5 Anomalous signal detection 

The target detection approach discussed above assumes that the target signal 
of interest resides in a dictionary that is available to the user. However, in 
some applications (such as military applications and surveillance), one might be 
interested in detecting objects not in the dictionary. In other words, the tar- 
get signals of interest arc anomalous and arc not available to the user. In this 
section we show how the target detection methods discussed above can be ex- 
tended to anomaly detection. In particular, we exploit the distance preservation 
property of the sensing matrix A to detect anomalous targets from projection 
measurements. 



5.1 Problem formulation 

Given observations of the form in (2), we are interested in detecting whether 
/* e © or /* is anomalous. Let us write the anomaly detection problem as the 
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following multiple hypothesis test: 

Hoi : \\f*-f\\<Thv some feV (15a) 
^H:||/;-/ll>^forall/eP (15b) 

where r € [O, v^) is a user-defined threshold that encapsulates our uncertainty 
about the accuracy with which wc know the dictionary.^ In particular, r controls 
how different a signal needs to be from every dictionary element to truly be 
considered anomalous. In the absence of any prior knowledge on the targets of 
interest, t can simply be set to zero. The null hypothesis in this setting models 
the normal behavior, while the alternative hypothesis models the abnormal or 
anomalous behavior. This formulation is consistent with the literature [17,45]. 

Note that the definition of the hypotheses given in (15a) and (15b) matches 
the definition in (5) for the special case where the dictionary contains just one 
signal. In this special case, the signal input /* is in the dictionary under the 
null hypothesis in both DSD and ASD problem formulations. ^ 



5.2 Anomaly detection approach 

Our anomaly detection approach and the associated theoretical analysis are 
based on a "distance preservation" property of A, which is stated formally in 
(18). We propose an anomaly detection method that controls the false discovery 
rate (FDR) below a desired level 6 for different background and sensor noise 
statistics. In other words, we control the expected ratio of falsely declared 
anomalies to the total number of signals declared to be anomalous. Note that 
here we work with the FDR as opposed to the pFDR, since it is possible for a 
scene to not contain any anomalies at all. We let F/i? = for i? = = since 
one does not declare any signal to be anomalous in this case. In [5], Benjamini 
and Hochberg discuss a p- value based procedure, "BH procedure" , that controls 
the false discovery rate of M independent hypothesis tests below a desired level. 
Let 

di = nun Wui - aiAf\\ = min \\aiA {f* - f) + nj|| (16) 

be the test statistic at the i*^ location. The p-value can be defined in terms of 
our test statistic as follows: 

P^=F{d^>d^\no^) (17) 

^Note that t cannot exceed \/2 because we assume that all targets of interest, including 
those in D and the actual target /* , are unit-norm. 

^The anomaly detection problem discussed here is more accurately described as target 
detection in the classical detection theory vocabulary. However, in recent works [23,46], the 
authors assume that the nominal distribution is obtained from training data and a test sample 
is declared to be anomalous if it falls outside of the nominal distribution learned form the 
training data. Our work is in a similar spirit where we learn our dictionary from training data 
and label any test spectrum that does not correspond to our dictionary as being anomalous. 
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where di ~ minj^gx) \\ctiA (/* — /) + n\\ and n ^ Af {0, 1) is independent of n^. 
This is the probabihty under the null hypothesis, of acquiring a test statistic at 
least as extreme as the one observed. Let us denote the ordered set of p- values 
by < P(2) < • • • < P{M) ^ind let 'H(oi) be the null hypothesis corresponding 
to (i)**^ p- value. The BH procedure says that if we reject all 'H(oi) for i = 1, . . . , t 
where t is the largest i for which < i6/M, then the FDR is controlled at S. 

To apply this procedure in our setting, we need to find a tractable expres- 
sion for the p-value at every location. This can be accomplished when A sat- 
isfies the distance-preservation condition stated below. Let V = VlJlf* : i G 
{1, . . . , M}} be the set of all signals in the dictionary and the ones whose projec- 
tions are measured. Note that \V\ < M + m. For a given e e (0, 1), a projection 
operator A e M^^^, K < N, is distance-preserving on V if the following holds 
for all u,v gV: 

{l-e)\\u-v\\ < \\A{u-v)\\ < {l + e)\\u-v\\,yu,v eV. (18) 

The existence of such projection operators is guaranteed by the celebrated 
Johnson and Lindcnstrauss (JL) lemma [25], which says that there exists random 
constructions of A for which (18) holds with probability at least 1 — 2|ype~^°^'^) 
provided K = (log \V\) < N, where c(e) = - [1,4]. Examples of 

such constructions arc: (a) Gaussian matrices whose entries are drawn from 
Af{0,l/K), (b) Bernoulli matrices whose entries are ±1/^/N with probability 
1/2, (c) random matrices whose entries are ±i/3/A/' with probability 1/6 and 
zero with probability 2/3 [1,4], and (d) matrices that satisfy the Restricted 
Isometry Property (RIP) where the signs of the entries in each column are 
randomized [28]. 

We now state our main theorem that gives a tight upper bound on the p- 
value at every location when are unknown and are estimated from the 

observations. Let {Si} be the estimates of {ai} that satisfy 

1-C<^<1 + C (19) 

for z = 1, . . . , M where C € [0, 1] is a measure of the accuracy of the estimation 

procedure. 

Theorem 4. If the i^^ hypothesis test is defined according to (15a) and (15b), 
the projection matrix A satisfies (18) for a given e S (0, 1), and the estimates 
{Sj} satisfy (19) for some Q G [0, 1], then the bound 

Pi<l-T (dl- K, (1 + efa^ (C + rf) (20) 

holds for all i = 1,...,M where T{-;K,v) is the CDF of a noncentral 
random variable with K degrees of freedom and noncentrality parameter v [54]- 

The proof of this theorem is given in Appendix D. We find the p-value 
upper bounds at every location and use the BH procedure to perform anomaly 
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detection. The performance of this procedure depends on the values of K, 
{ai}, T and e. The parameter e is a measure of the accuracy with which the 
projection matrix A preserves the distances between any two vectors in M^. A 
value of e close to zero implies that the distances are preserved fairly accurately. 
When {ai} are unknown and estimated from the observations, the performance 
depends on the accuracy of the estimation procedure, which is reflected in our 
bounds in (20) through (. 

One can easily estimate from {yj} for some choices of A. For instance, 
if the entries of the projection matrix A are drawn from AA(0, '^/K), the {ofj} can 
be estimated using a maximum likelihood estimator (MLE) by exploiting the 
statistics of the projection matrix and noise. Note that the j*"'' element of the i'^ 

measured spectrum is yij = YJk=i '^iflk'^3,k + j ~ ^ (O) Yl,k=i Ttfi,k^ + l) 
for j e {1, . . . ,K}. Since II//II2 = 1 according to our problem formulation, 
yi j ^'^r^' Af (o,"^ + The MLE of given by Sj = arg max^ ¥{yi\A, a) then 
reduces to 

ai = ^i\\yiP-K). (21) 

In practice, we use Sj = ^ (||l/i|p — -^)+ where the (a)+ = o if a > and 
otherwise to ensure that — K is nonnegative. We can use concentra- 

tion inequalities to show that with high probability, ||yi||2 is tightly concen- 
trated around its mean E 

K ||2 



a\ + K. Since y^j ' - ' W (o, f + l) 



-K 



yi\\2 ^ Xk- From Lemma 2.2 in [52], and Proposition 1 and Remark 1 
in [51], for any t > 

^{\\\yi\\l-{c^i+K)\>t^ <Cexp(-ci2) (22) 

for some absolute constants C, c > 0. This result shows that with high proba- 
bility, |ly);||2 ^ K is normcgative. 

The experimental results discussed in Sec. 6 demonstrate the performance 
of this detector as a function of K, {ai} and r when {0;^} are known and as a 
function of K, r and ( when {ai} are estimated. 

6 Experimental Results 

In the experiments that follow, the entries of A are drawn from J\f{0, ^/K). 



6.1 Dictionary signal detection 

To test the effectiveness of our approach, we formed a dictionary V of nine spec- 
tra (corresponding to different kinds of trees, grass, water bodies and roads) 

obtained from a labeled HyMap (Hyperspectral Mapper) remote sensing data 
set [32] , and simulated a realistic dataset using the spectra from this dictionary. 
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Each HyMap spectrum is of length N = 106. We generated projection mea- 
surements of these data such that Zi = ai^{f* + bi) + Wi according to (1), 
where Wi ~ Af{0, a'^I), f* G V for i = 1, . . . , 8100, hi M {nb, S?,) such that 
Sb satisfies the condition in (4), and = a*VK where a* ^ W[21,25] and U 
denotes uniform distribution. We let = 5 and model {aj} to be proportional 
to VK to account for the fact that the total observed signal energy increases as 
the number of detectors increases. We transform the Zi by a series of operations 
to arrive at a model of the form discussed in (2), which is yi = aiAf* + Uj. 
For this dataset, Pmin = 0.04938, Pmax = 0.1481, and rfmin = 0.04341. 

We evaluate the performance of our detector (7) on the transformed obser- 
vations, relative to the number of measurements K, by comparing the detection 
results to the ground truth. Our MAP detector returns a label L^^^ for every 
observed spectrum which is determined according to 

Lf^P = argmin (h\y, " a,Af('^\\' - logp^A 
^e{i,...,m},/Wei' / 

where m is the number of signals in V, and p^^^ is the a priori probability of 

target class £. In our experiments we evaluate the performance of our classifier 
when (a) {ai} are known (AK) and (b) {ai} are unknown (AU) and must be 
estimated from y, respectively. The empirical pFDR^-'^ for each target spectrum 
j is calculated as follows: 

pFDR(^) = V -^-^ — — 

where {ip^} denote the ground truth labels. The empirical pFDR^ -* is the 
ratio of the number of missed targets to the total number of signals that were 
declared to be nontargets. The plots in Fig. 1(a) show the results obtained using 
our target detection approach under the AK case (shown by a dark gray dashed 
line) and the AU case (shown by a light gray dashed line), compared to the 
theoretical upper bound (shown by a solid line). These results are obtained by 
averaging the pFDR values obtained over 1000 different noise, sensing matrix 
and background realizations. Note that theoretical results only apply to the 
AK case since they were derived under the assumption of {ai} being known. 
The experimental results are shown for both AK and AU cases to provide a 
comparison between the two scenarios. In both these cases, the worst-case 
empirical pFDR curves decay with the increase in the values of K. In the AK 
case, in particular, the worst-case empirical pFDR curve decays at the same rate 
as the upper bound. In this experiment, for a fixed amin and dmin, we chose 
K to satisfy (13c). The theory is somewhat conservative, and in practice the 
method works well even when the values of K are below the bound in (13c). 

In the experiment that follows, we let a* ~ ZY[10, 20], where U denotes a 
uniform random variable, = ^fKa* and evaluate the performance of our 
detector for different values of K that are not necessarily chosen to satisfy 
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Figure 1: Compressive target detection results under the AK ({ai} known) and 
AU {{ai} unknown) cases respectively as a function of K. (a) Comparison of 
the worst-case empirical pFDR curves with the theoretical bounds when SNR 
is high, (b) Comparison of the results obtained by the proposed method using 
projection measurements using # designed according to (24), $ chosen at ran- 
dom, and the ones using downsampled measurements (DM) when the SNR is 
low. 



(13c). In addition, we also compare the performance of our detection method 
to that of a MAP based target detector operating on downsampled versions of 
our simulated spectral input image. The reason behind such a comparison is to 
show what kinds of measurements yield better results given a fixed number of 
detectors. 

For an input spectrum g e M^, wc let g E M.^ denote its downsampled 
approximation. Specifically, the j*"^ element of is X^fci where r = 

\N/ K~\ . Let us consider making observations of the form 

= ^ + n, e (23) 

c 

where = ^ifi + is the _fC-dimensional downsampled version of /* + bi for 
K < N , rii ^ J\f{0, cPl) for = 5 and c is a constant that is chosen to preserve 
the mean signal-to-noise ratio corresponding to the downsampled and projection 
measurements. The MAP-based detector operating on the downsampled data 
returns a label D^^^ for every observed spectrum which is determined according 
to 

Df^^ = argmin (y, - a,M^ G'^ (y, ~ a J^) - logpW 

l£{l,.....m}jW£V ^ ^ ^ ' 

where G = Sfc -I- a^I and Sh is the covariance matrix obtained from the down- 
sampled versions of the background training data and J*-^^ is the downsampled 
version of f'^^^ G T). The algorithm declares that target spectrum /'^^ G I? is 
present in the i*'' location if £)¥AP — j_ \^ order to illustrate the advantages 
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of using a $ designed according to (24), we compare the performances of the 
proposed anomaly detector when # is chosen to be a random Gaussian matrix 
whose entries are drawn from J\f {0,1 /K) and when * is chosen according to 
(24). Fig. 1(b) shows a comparison of the results obtained using the projection 
measurements obtained using designed according to (24), $ chosen at ran- 
dom, and the downsampled measurements under the AK case. These results 
show that the detection algorithm operating on projection measurements us- 
ing # designed using background and sensor noise statistics yield significantly 
better results than the one operating on the downsampled data, and that the 
empirical pFDR vahics in our method decays with K. The improvement in per- 
formance using projection measurements comes from the distance-preservation 
property of the projection operator A. While a Gaussian sensing matrix A 
preserves distances between any pair of vectors from a finite collection of vec- 
tors with high probability [1,4], downsampling loses some of the fine differ- 
ences between similar-looking spectra in the dictionary. Furthermore, when * 
is chosen at random, the resulting whitened transformation matrix is not nec- 
essarily distance-preserving. This may worsen the performance as illustrated in 
Fig. 1(b). 

6.2 Anomaly detection 

In this section, we evaluate the performance of our anomaly detection method on 
(a) a simulated dataset and provide a comparison of the results obtained using 
the proposed projection measurements and the ones obtained using downsam- 
pled measurements, and (b) real AVIRIS (Airborne Visible InfraRed Imaging 
Spectrometer) dataset. 

6.2.1 Experiments on simulated data 

We simulate a spectral image /* composed of 8100 spectra, where each of them 
is either drawn from a dictionary V = {/^^\ • • • , /^^^} consisting of five labeled 
spectra from the HyMap data that correspond to a natural landscape (trees, 
grass and lakes) or is anomalous. The anomalous spectrum is extracted from 
unlabeled AVIRIS data, and the minimum distance between the anomalous 
spectrum f^'^^ and any of the spectra in V is rfmin = mhif^j) ||/— || = 0.5308. 
The simulated data has 625 locations that contain the anomalous spectrum. 
Our goal is to find the spatial locations that contain the anomalous AVIRIS 
spectrum given noisy measurements of the form Zj = * {(^ifi + bi) + Wi where 
bi ~ (/2f,,St), $ is designed according to (24), Wi ^ A/'(0,(t^7) and /* G V 
under Hoi- As discussed in Sec. 5, is anomalous under Hu, and our goal is 
to control the FDR below a user-specified false discovery level 6. We simulate 
{ttj} = \/Ka* where a* ^ Z^[2, 3]. In this experiment we assume the availability 
of background training data to estimate the background statistics and the sensor 
noise variance tr^. Given the knowledge of the background statistics, we perform 
the whitening transformation discussed in Sec. 2 and evaluate the detection 
performance on the preprocessed observations given by (2). 
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For a fixed t = 0.1 and e — 0.1, we evaluate the performance of the detector 
as the number of measurements K increases under the AK and AU cases respec- 
tively, by comparing the pseudo-ROC (receiver operating characteristic) curves 
obtained by plotting the empirical false discovery rate against 1 — FNR, where 
FNR is the false nondiscovery rate. Note that 1 — FNR is the expected ratio 
of the number of null hypotheses that are correctly rejected to the number of 
declared null hypotheses. The empirical FDR and FNR are computed according 
to 

FDR = y i and FNR = y i 

where pt is the p-valuc threshold such that the BH procedure rejects all null 
hypotheses for which pi < pt, and the ground truth label Lf^ = if the i^^ 
spectrum is not anomalous, and 1 otherwise. In this experiment, we consider 
three different values of K approximately given hy K E {N/Q, N/3, N/2} where 

= 106, and evaluate the performance of our detector for each K. Further- 
more, in our experiments with simulated data, we declare a spectrum to be 
anomalous if > 77 where 77 is a user-specified threshold and di is defined in 
(16). We use the p-value upper bound in (20) in our experiments with real data 
where the ground truth is unknown. 

We compare the performance of our method to a generalized likelihood ratio 
test (GLRT)-based procedure operating on downsampled data, where we collect 
measurements of the form in (23) and /* e V under Hoi- Observe that yilTloi ~ 
J2fev^ifi — /)-^(Q'j/)^6 + I), where / refers to the downsampled version 
of f £ T>. In this experiment we assume that each spectrum in V is equally 
likely under Hoi for j = 1, . . . , M. The GLRT-based approach declares the i^^ 
spectrum to be anomalous if 

-logP(yi|7^oi)^^'r; 

T-toi 

for i = 1, . . . , M, where is a user-specified threshold [45]. While our anomaly 
detection method is designed to control the FDR below a user-specified thresh- 
old, the GLRT-based method is designed to increase the probability of detection 
while keeping the probability of false alarm as low as possible. To facilitate a 
fair evaluation of these methods, we compare the pseudo-ROC curves (FDR 
versus 1 — FNR) and the actual ROC curves (probability of false alarm p j ver- 
sus probability of detection p^) corresponding to these methods obtained by 
averaging the empirical FDR, FNR, pd and pj over 1000 different noise and 
sensing matrix realizations for different values of K . We also compare the per- 
formance of the proposed method when $ is chosen according to (24) and when 
it is chosen at random, as discussed in the previous section. Figs. 2(a) and 2(e) 
show the pseudo-ROC plots and the conventional ROC plots obtained using the 
GLRT-based method operating on downsampled data when {ai} are known. 
Figs. 2(b) and 2(f) show the results obtained by using a random Gaussian $ 
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instead of the $ in (24). Figs. 2(c) and 2(g) show the pseudo-ROC plots and 
the conventional ROC plots obtained using our method when are known. 
These plots show that performing anomaly detection from our designed pro- 
jection measurements yields better results than performing anomaly detection 
on downsampled measurements and on measurements obtained using a random 
Gaussian This is largely due to the fact that carefully chosen projection mea- 
surements preserve distances (up to a constant factor) among pairs of vectors 
in a finite collection, where as the downsampled measurements fail to preserve 
distances among vectors that are very similar to each other. Similarly, a ran- 
dom projection matrix $ is not necessarily distance-preserving post-whitening 
transformation, which leads to poor performance as illustrated in Figs. 2(b) and 
2(f). Figs. 2(d) and 2(h) show the pseudo-ROC plots and the conventional ROC 
plots obtained using our method when {a,;} are \mknown, and are estimated 
from the measurements. Note that the value of ( decreases as K increases since 
the estimation accuracy of {aj} increases with increase in K. These plots show 
that the performance improves as we collect more observations, and that, as 
expected, the performance under the AK case is better than the performance 
under the AU case. 



6.2.2 Experiments on real AVIRIS data 

To test the performance of our anomaly detector on a real dataset, we con- 
sider the unlabeled AVIRIS Jasper Ridge dataset g G ]r614x512x197^ ^Yiich is 
publicly available from the NASA AVIRIS website, http://aviris.jpl.nasa. 
gov/html/aviris.freedata.html. We split this data spatially to form equi- 
sized training and validation datasets, g* and respectively, each of which is 
of size 128 x 128 x 197. Figs. 3(a) and 3(b) show images of the AVIRIS train- 
ing and validation data summed through the spectral coordinates. The training 
data are comprised of a rocky terrain with a small patch of trees. The validation 
data seems to be made of a similar rocky terrain, but also contain an anomalous 
lake-like structure. The goal is to evaluate the performance of the detector in 
detecting the anomalous region in the validation data for different values of K. 
We cluster the spectral targets in the normalized training data to eight different 
clusters using the K-means clustering algorithm and form a dictionary V com- 
prising of the cluster centroids. Given the dictionary and the validation data, 
we find the ground truth by labeling the i*^ validation spectrum as anomalous 
if min/gx) / — > t. Since the statistics of the possible background con- 

tamination in the data could not be learned in this experiment because of the 
lack of labeled training data, the dictionary might be background contaminated 
as well. The parameter t encapsulates this uncertainty in our knowledge of the 
dictionary. In this experiment, we set r = 0.2. 

We generate measurements of the form yi = ^/Kg'V +ni for i = 1, . . . , 128 x 
128, where ^ M{Q,I). The ^/K factor indicates that the observed signal 
strength increases with K. For a fixed FDR control value of 0.01, Figs. 3(c) 
and 3(d) show the results obtained for K w N/^ and K w A''/2 respectively. 
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Fig. 3(c) shows how the probabihty of error decays as a function of the number 
of measurements K. The results presented here are obtained by averaging over 
1000 different noise and sensing matrix reaUzations. Prom these results, we can 
see that the number of detected anomalies increases with K and the number of 
misclassifications decrease with K. 



7 Conclusion 

This work presents computationally efficient approaches for detecting known 
targets and anomalies of different strengths from projection measurements with- 
out performing a complete reconstruction of the underlying signals, and offers 
theoretical bounds on the worst-case target detector performance. This paper 
treats each signal as independent of its spatial or temporal neighbors. This 
assumption is reasonable in many contexts, especially when the spatial or tem- 
poral resolution is low relative to the spatial homogeneity of the environment or 
the pace with which a scene changes. However, emerging technologies in compu- 
tational optical systems continue to improve the resolution of spectral imagers. 
In our future work we will build upon the methods that we have discussed here 
to exploit the spatial or temporal correlations in the data. 



A Proof of Theorem 1 

Using linear algebra and matrix theory, it is possible to show that if B 
I — AUbA^ is positive definite, then 



* = aB-'/^A 



(24) 



satisfies (3).'^ In particular, we can substitute (24) in (3) to verify that the pro- 

posed construction of * satisfies (3). Observe that C# = (^S^,*^ -|- cr^ J) 
can be written in terms of (24) as follows: 



aB~2A 



tB~2A 



^2^-1/2 (^SfeA^) (^B-'^Y + a' 



= l^a^B-i{I-B)(^B-i) +a^lj = {a^B-^'^ = a-^Bi (25) 

where the third-to-last equation follows from the definition of B and (25) fol- 
lows from the fact that B is symmetric and positive definite. If B is pos- 

itiv<- (l(fiiiit<\ tlu^ii is i)ositive definite as well and can be decomposed 



'^Thc authors would like to thank Prof. Roummel Marcia for fruitful discussions related to 
this point. 
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as B^^ = (B^-'^/^) B^^''^, where the matrix square root B~^/^ is sym- 
metric and positive definite. By substituting (25) and (24) in (3), we have 
= B^/'^aB~^/'^ A = A. A sufficient condition for B to be positive 
definite can be derived as foUows. 

To ensure positive definiteness oi B, wc must have 

x^Bx = x^x - x'^ (ASfcA^) a; > (26) 

for any nonzero x G M.^ . Note that since Sf, is positive semidefinite, 
x^ (ASfeA"^) a; > 0. However, the right hand side of (26) is > only if the 
spectral norm of ASfoA'^ is < 1, since a; ^ (ASbA^) a; < ||a;|p • || ASfoA^lj. The 
norm of ASj,A^ is in turn bounded above by 

ll^lSbA^II < ||A||||Sb||||A^|| = = ||Af A^^, 

since ||A|| = ||^"^|| and ||Sb|| = Amax, where Amax is the largest eigenvalue of 
Sfe. To ensure < 1, ||A|pAmax has to be < 1, which leads to the 

result of Theorem 1 . 



B Proof of Theorem 2 



The proof of Theorem 2 adapts the proof techniques from [48] to nonidentical 
independent hypothesis tests. We begin by expanding the pFDR definition in 
(8) as follows: 



M 



pFDR(^') (r) = ^E 



fe=i 



ViT) 



RiT) 



R{T) = k 



'{R{T) = k\R{T) > 0). 



Observe that -R(r) = k implies that there exists some subset Sk = 
{ui ,Uh} C {!,..., M} of size k such thaty„^ e Tu} ioi £ = l,...,k and j/j ^ 

for all i^Sk. To simplify the notation, let Ag^ = UueSk ^ H^^s, 
where is the complement of , denote the significance region that corre- 
sponds to set Sk, and T = (yi, . . . , j/m) be a set of test statistics corresponding 
to each hypothesis test. Considering all such subsets we have 



M 



pFDR« (r)=^^E 



V{T) 



xP(re Asji?(r) >o). (27) 



By plugging in the definition of V ({F^}) from (9), we have 



E[V{T)\T€As,]=E 



M 



El 



EE 



K 

EK 



n 



U) 



1=1 



TeAsk 

ue J 



(28) 
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for all ui G Sk since the tests are independent of each other given A. The 
posterior probability P {'H^P = 



expanded using Bayes' rule as 



Di € for the i**^ hypothesis test can be 



(i) 



U) 



P {fi 7^ /(^')) 



(29) 



where fi = argmaxyic^)^^ P (/»* = Z*-^^ | l/i! Q^i, To upper bound the nu- 
merator of (29), consider the probability of misclassification given by (Pe)i = 

^{fi¥= ft) where /* = e V, which can be expanded as follows: 

m 



^-1 



7/t 

(30) 

The denominator term in (29) can be expanded as follows: 

p 7^ f^'^) =p{fi^ f^'^ I /; = f^'^) p (/; = f^'^) 

Observe that P (^/j 7^ /* = /^•''j is nonnegative, and 



(. 



= 1 
= 1 - 

Thus 



(Pa)i 



f* ^ f (j) 1 = 1 L > 1 V Z_ 

) P (/* 7^ /«) - P (/* 7^ /(^■)) 



1 -P' 



{3) ■ 



{fi + /(^)) > (1 - ^f^) (1 - = 1 - p(^) - (Pe)i. (31) 



Substituting (30) and (31) in (29), 



P(-H, 



(Pe). 



< 



(Pc)r 



l-pO)-(Pe)i - l-pO")-(Pe). 



(32) 
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By substituting (32) in (27) and (28) we have: 



pFDR(^) (r) < E E 1 ] X P (T e A,, I i? (r) > o) 

= i-j^-7) EE^(^ ^ A,,|i?(r) > 0) < ^ ;^-i7; 



since Es^ P (T e AgJ i? (F) > 0) < 1. The result of Theorem 2 is ob- 

tained by finding an upper bound on the worst-case pFDR given by 

pFDR^g^ = max pFDR^^'^ (F) 

je{l,...,m} 

^ (Pe)max (Pe)max 

< max — 



je{l,...,m} 1 - pU) - (Pe)max 1 - Pmax - (Pe)max 

where pmax = max<-g|i_...^„} p^^K 



C Proof of Theorem 3 

The proof is via a random selection technique, similar to random coding ar- 
guments common in information theory. Specifically, we will draw a K x N 
sensing matrix A at random from a particular distribution and then show that, 
for e, N, and K satisfying the conditions of the theorem, the probability that 
the conclusions of the theorem will fail to hold for this randomly chosen A is 
strictly smaller than unity. This will imply that the conclusions of the theorem 
must be true for at least one (deterministic) realization of A. 
We begin by specifying all the relevant random variables: 

• /]*,..., are i.i.d. random variables taking values in the dictionary D = 
{/(I), . . . , / W} with probabilities p^^^ = Pr{/* = e {1, . . . , m}; 

• ni,...,nM ''^'^(0, J); 

• G is a random K x N matrix with i.i.d. Af{0, 1) entries. 

We assume that {f*}fii, {Tijli^u ^^d G are mutually independent, and we 
will denote by P their joint probability distribution. Finally, we let A = 
and consider the observation model 

yi = aiAf*+rn, ie {!,..., M} (33) 

where aM > are the given signal strengths. 

We first consider the case when ai = . . . = um = oi. Given e, N , and K, we 
define the following two error events: 

£x = {\\G\\ > (1 + e)iVK + VN)} , and £2 = {/i /i*} , 
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where, for each i G {1, . . . , M}, fi is defined according to (12). Note that, 
since we have assumed that the aj's are equal and all the pairs (/*,«»), i S 
{1, . . . , M}, are i.i.d., 

P{f^^f:\A) = P{£2\A), \/iG{l,...,M}. (34) 

We will now prove that 

The union bound gives P(£:i U ^2) < P(£:i) + P(£:2)- First, we bound P{£i). 
To do that, we use the following concentration result for Gaussian random 
matrices [13]: for any t > 0, 

Pr {||G|| >VK + VN + t^< 2e-*'/2. 

Letting t = t[yfK + \fN) and using the fact that > (if + iV)e^, we get 

P(£,)<2exp(-i^^). (36) 

Next, we bound P(^^2)- To that end, we use the following result, which is a 
straightforward extension of Theorem 1 in [21] to nonequiprobable dictionary 

elements: 

Lemma 1 (Compressive classification error). Consider the problem of clas- 
sifying a signal of interest f* £ V = {f^^\---,f^"^^} to one of m known 
target classes by making observations of the form y = aAf* + n where 
n ~ A/"(0, cr^J), given the knowledge of the dictionary T>, prior probabilities 
pU) for j G {I,-'' j'Ti}, sensing matrix A, and the noise variance a^. If the 
entries of A are drawn i.i.d. from J\f {0,1/ K) independently of f* and n, and 
the estimate f is obtained according to (12), then 

where the probability is taken with respect to the distributions underlying f*, A, 
and n. 

Using the above lemma, we have 

_ K 

P(f2) < f 1 + ^-^"l ' . (37) 

Combining (36) and (37), we get (35). 
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Because of (13a), the right-hand side of (35) is less than 1 — e—pmax, which 
is strictly positive by hypothesis. Thus, from the fact that 

P(flUf2)=E[P(fiUf2|A)] 

and from (34), it follows that there exists at least one deterministic choice of 
the K X N sensing matrix A*, such that: 

M'll<(l + e)(l + \/f j (38a) 

(P.)„„(A.) <L:J^(^^■^y%,,^ f-^^i^) (38b) 



Pmin V 4if 

where, for a given choice of A, (Pe)max(^) denotes the maximum probability 
of error defined in Theorem 2. 

Next, from (38a) and (13b) it follows that A* satisfies the conditions of 
Theorem 1. Finally, we use (11) to bound the worst-case pFDR achievable 

with A*. First of all, we note that the function U(x) = is twice 

differentiable and convex on the interval [0,1 — Pmax]- Therefore, for any x G 
[0, 1 — Pmax] and any h > small enough so that x + h G [0, 1 — Pmax]) we have 



U{x + h)< U{x) + U'{x + h)h = U{x) + ^r"^\^2 - (39) 



Let US choose 



Pmin V 4ii' 

Then from (13a) we have x + h < 1 — e — Pmax < 1 — Pmax, and from (13c) we 
have X + h>0. Hence, using (39) and simplifying, we obtain the bound 

pFDR_(A*)<^(i^(l + ^)"-^) + 

Fmin \ ^ i-'miii \ / Pmm J 

2(1 -Pmax) ( {K + N)e' 



This proves the theorem for the case ai = . . . = um = Oi. 

To handle the case when the a^'s are distinct, we simply let 

I = argmm ai 
ie{i,...,M} 

and replace the definition of the error event £2 with £'2 = {/i* ^ /»* }• Then 
the same argument goes through, except that instead of (34) we use the bound 

P(/i ^ n\A) < P(/,. ^ f*,\A) = P(f^|A), Vi ^ i* 
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which fohows from the following argument. First of all, we can replace the 
observation model with the equivalent model 

y,^Af*+n,. i&{l,...,M} 

where = -^ni ~ A/'(0, -^I)- Secondly, from the fact that a, > a^* = amin 
for any i ^ i* \i follows that n^. is equal in distribution to rij + n'^, where 
n'^ 1^ Af ^0, — ^p—^ is independent of rij. This implies that the i*th 

observation is the noisiest, and the corresponding MAP estimate fi* has the 
largest probability of error. 



D Proof of Theorem 4 



We first prove this theorem assuming that {a^} are known and later ex- 
tend to the case where {3^} are estimated from the observations. Let fi = 
argminy.g2? Il/i* ~ /II- The p-value expression in (17) can be expanded as fol- 
lows: 



d,, > di 



( nun \\a,A{f* - f) + n\\ > d, 



< P (||a,A(/* - /,) + nil > di\ Hoi) = ^ (\\aiA{f* - f^) + nf > d^^\Hoi) ■ 

(40) 

Note that \\aiA{f* — fi)+n\\'^ is a nonccntral random variable with K degrees 

of freedom and a noncentrality parameter z/j — \\aiA (^f* — fij |p. Thus (40) 

can be written in terms of a noncentral CDF (d?; if, f j) with parameter 
rf?. The upper and lower bounds on i/j can be obtained using the properties of 
the projection matrix A. Applying (18), we see that 

a- (1 - efWf* - fif <ui< a^,{l + efWf* - f£ 
with high probability. Thus, 

ft < 1 - P ( \\aiA{f* - I) + nf < dl\ Hoi) = 1 - {dl, K, u^) (41) 

< 1 - J- (dj;K, alii + efWf* - /;f ) < 1 - J- {dl, K, al{l + efr') 

since ||/* - /|| < r for all / e P under Hoi. 

When {ai} are estimated from the observations such that {Si} satisfy (19), 
we can write the p-value expression in (41) as follows: 



Pi < 1 - J" ^df; K, A (aif* - oiif^ 
<l-T{dhK,{l + efal 









i fi 



(42) 
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where(42) is due to the distance preservation property of Agiven in (18). Ob- 

2 



serve that 



fi Si 



can be upper bounded as shown below: 



1 /i 



< 



1 /; 



+ \\f:-fi 



1 



+ ||/;-/.||) < {c + \\f*-f^ 



where third-to-last equation is due to the triangle inequality, second-to-last 
equation comes from the assumption that ||/*|| = 1, and the last inequality 
is due to (19). By applying this result to (42) and exploiting the fact that 
Wfi — f\\ < under Hoi for some f GV, we have 

Pi<l-T K, (1 + efa^ (C + II/; - /,||)') <1-T {dl, K, (1 + efa^ {( + rf 
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method operating on downsampled 
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Figure 2: Comparison of the performances of the proposed anomaly detector 
using a random $, the proposed anomaly detector using the designed $ in 
(24) and the GLRT-based method operating on downsampled data for different 
values of K when a* £U[2,3] and a,; = a*\/K. 





(d) Anomalies detected (e) Plot of the probability of error 
(shown by white dots) for pe for different values of K. 
K w N/2 = 99. 



Figure 3: Anomaly detection results corresponding to real AVIRIS data for a fixed 
FDR control of 0.01. 
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